Data Discription

SOMETHING ABOUT THE INSURANCE DATA

By Python

Python

The following packages are used in Python version of the tutorial.

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from statsmodels.regression.linear_model import OLS
from statsmodels.tools import add_constant

Data Preprocessing

#read the data and save as a dataframe
ds0 = pd.DataFrame(pd.read_csv('~/Desktop/506/groupproject/insurance.csv'))

#have a glimpse of the data
print(ds0.head())
##    age     sex     bmi  children smoker     region      charges
## 0   19  female  27.900         0    yes  southwest  16884.92400
## 1   18    male  33.770         1     no  southeast   1725.55230
## 2   28    male  33.000         3     no  southeast   4449.46200
## 3   33    male  22.705         0     no  northwest  21984.47061
## 4   32    male  28.880         0     no  northwest   3866.85520
print(ds0.columns)

#check if there is any null
## Index(['age', 'sex', 'bmi', 'children', 'smoker', 'region', 'charges'], dtype='object')
ds0.isnull().sum()
#no null :)

# map the character to factor variable
## age         0
## sex         0
## bmi         0
## children    0
## smoker      0
## region      0
## charges     0
## dtype: int64
ds0.sex = ds0.sex.map({'female':1, 'male':0})
ds0.smoker = ds0.smoker.map({'yes':1, 'no':0})
ds_reg = pd.get_dummies(ds0['region'])

# since we change it to dummy variables, we have to drop one of the column
ds0 = ds0.join(ds_reg.iloc[:,0:3])

Data Visualization

ds0['logcharges'] = np.log(ds0.charges)
ds1 = ds0.drop(['charges','region'],axis=1)

import seaborn as sns
import matplotlib.pyplot as plt
# As we can see, the charges have a heavy tail. So, we take log on it.
sns.distplot(ds0['charges'], color='b')
# Data in every other column looks fine right now.
f,ax = plt.subplots(2,3,figsize=(10,8))
sns.distplot(ds1["age"], kde=False, ax=ax[0,0])
sns.boxplot(x='sex',y='charges', data=ds0, ax=ax[0,1])
sns.distplot(ds1['logcharges'], ax=ax[0,2], kde=False, color='b')
# The logcharges are now normally distributed.
sns.distplot(ds1['bmi'],ax=ax[1,0], kde=False, color='b')
sns.countplot('children',data=ds1, ax=ax[1,1])
sns.countplot('region',data=ds0, ax=ax[1,2])

ax[0,0].set_title('Distribution of Ages')
ax[0,1].set_title('Charges boxplot by Sex')
ax[0,2].set_title('Distribution of log charges')
ax[1,0].set_title('Distribution of bmi')
ax[1,1].set_title('Distribution of children')
ax[1,2].set_title('Distribution of regions')

Optional Step

Actually centralize is not a must, which would not change the general result of the regression. In this example, we won’t do so.

def centralize(arra):
    cen = arra - np.mean(arra)
    var = np.sqrt(sum(cen**2)/(len(arra)-1))
    arra = cen/var
    return arra
ds0.bmi = centralize(ds0.bmi)
ds0.age = centralize(ds0.age)

Model Diagnostics

Firstly, we define dependent variable y and covariates X for future convenience.

y = ds1.logcharges
X = ds1.drop(['logcharges'], axis = 1)

#Conduct the first regression!
regr_1 = OLS(y, add_constant(X)).fit()

VIF

def variance_inflation_factor(exog, exog_idx):
    k_vars = exog.shape[1]
    x_i = exog.iloc[:, exog_idx]
    mask = np.arange(k_vars) != exog_idx
    x_noti = exog.iloc[:, mask]
    r_squared_i = OLS(x_i, x_noti).fit().rsquared
    vif = 1. / (1. - r_squared_i)
    return vif
    
# We skip the constant column.
VIF = [variance_inflation_factor(add_constant(X), i) for i in range(1,X.shape[1]+1)]
print(VIF)
## [1.0168221490038107, 1.0089001621005733, 1.106629732428617, 1.004010642137024, 1.0120736649061481, 1.5262104138696806, 1.524748317955475, 1.5971027909244977]

The vif of each column is ok. All of them are smaller than 5, even 2.

Residual Distribution

sns.distplot(regr_1.resid) 

Acting like normal which is good. Since the residual itself is normal, box-cox is not necessary.

Box-Cox transformation

namda = 0.1
regr_test = OLS((y**namda-1)/namda, add_constant(X)).fit()
sns.jointplot((y**namda-1)/namda, regr_test.resid)
sns.distplot(regr_test.resid)

Residual Plot

If the regression works well, the dependent data should be uncorrelated with the residual. If we have gaussian prerequisite, independence is what we expect.

sns.jointplot(y, regr_1.resid) 

The residual plot looks very strange. Y and residual are highly dependent. Maybe the model is not linear at the first place. Since there is explicit non-linear in this model, we have to add some non-linear covariates in it.

partial residual plot

Which attempts to show how covariate is related to dependent variable, if we control for the effects of all other covariates.

f,ax = plt.subplots(1,2,figsize=(10,8))
sns.jointplot(regr_1.params.bmi * X.bmi + regr_1.resid, X.bmi, ax=ax[0,0])
sns.jointplot(regr_1.params.age * X.age + regr_1.resid, X.age, ax=ax[0,1])

Partial residual plots show that both bmi and age could significantly affect the charges.

Model Selection

For this part, we would try different adding variables and try to drop variables that are useless. The primary concern in this case is that we have to add variables so that the residual is relatively indep with y.

  • Original model

print(regr_1.summary())
##                             OLS Regression Results                            
## ==============================================================================
## Dep. Variable:             logcharges   R-squared:                       0.768
## Model:                            OLS   Adj. R-squared:                  0.767
## Method:                 Least Squares   F-statistic:                     549.8
## Date:                Sun, 02 Dec 2018   Prob (F-statistic):               0.00
## Time:                        20:01:19   Log-Likelihood:                -808.52
## No. Observations:                1338   AIC:                             1635.
## Df Residuals:                    1329   BIC:                             1682.
## Df Model:                           8                                         
## Covariance Type:            nonrobust                                         
## ==============================================================================
##                  coef    std err          t      P>|t|      [0.025      0.975]
## ------------------------------------------------------------------------------
## const          6.8262      0.076     90.168      0.000       6.678       6.975
## age            0.0346      0.001     39.655      0.000       0.033       0.036
## sex            0.0754      0.024      3.091      0.002       0.028       0.123
## bmi            0.0134      0.002      6.381      0.000       0.009       0.017
## children       0.1019      0.010     10.085      0.000       0.082       0.122
## smoker         1.5543      0.030     51.333      0.000       1.495       1.614
## northeast      0.1290      0.035      3.681      0.000       0.060       0.198
## northwest      0.0652      0.035      1.863      0.063      -0.003       0.134
## southeast     -0.0282      0.034     -0.819      0.413      -0.096       0.039
## ==============================================================================
## Omnibus:                      463.882   Durbin-Watson:                   2.046
## Prob(Omnibus):                  0.000   Jarque-Bera (JB):             1673.760
## Skew:                           1.679   Prob(JB):                         0.00
## Kurtosis:                       7.330   Cond. No.                         330.
## ==============================================================================
## 
## Warnings:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
  • The first try : add an interactive covariate smoker:bmi

X_2 = X.iloc[:,:]
X_2['sm_bm'] = X_2.smoker * X_2.bmi
regr_test = OLS(y, add_constant(X_2)).fit()
print(regr_test.summary())
# which certainly improve the performance of the model
##                             OLS Regression Results                            
## ==============================================================================
## Dep. Variable:             logcharges   R-squared:                       0.784
## Model:                            OLS   Adj. R-squared:                  0.782
## Method:                 Least Squares   F-statistic:                     534.0
## Date:                Sun, 02 Dec 2018   Prob (F-statistic):               0.00
## Time:                        20:01:19   Log-Likelihood:                -762.05
## No. Observations:                1338   AIC:                             1544.
## Df Residuals:                    1328   BIC:                             1596.
## Df Model:                           9                                         
## Covariance Type:            nonrobust                                         
## ==============================================================================
##                  coef    std err          t      P>|t|      [0.025      0.975]
## ------------------------------------------------------------------------------
## const          7.1128      0.079     90.254      0.000       6.958       7.267
## age            0.0348      0.001     41.281      0.000       0.033       0.036
## sex            0.0871      0.024      3.688      0.000       0.041       0.133
## bmi            0.0034      0.002      1.502      0.133      -0.001       0.008
## children       0.1031      0.010     10.569      0.000       0.084       0.122
## smoker         0.1564      0.146      1.071      0.284      -0.130       0.443
## northeast      0.1375      0.034      4.062      0.000       0.071       0.204
## northwest      0.0664      0.034      1.964      0.050    8.84e-05       0.133
## southeast     -0.0252      0.033     -0.757      0.449      -0.091       0.040
## sm_bm          0.0456      0.005      9.773      0.000       0.036       0.055
## ==============================================================================
## Omnibus:                      520.045   Durbin-Watson:                   2.036
## Prob(Omnibus):                  0.000   Jarque-Bera (JB):             2150.321
## Skew:                           1.846   Prob(JB):                         0.00
## Kurtosis:                       7.994   Cond. No.                         661.
## ==============================================================================
## 
## Warnings:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
  • The second try : add an interactive covariate smoker:age

X_2['sm_ag'] = X_2.smoker*X_2.age
regr_test = OLS(y, add_constant(X_2)).fit()
print(regr_test.summary())
# which increase the performance of the model significantly
##                             OLS Regression Results                            
## ==============================================================================
## Dep. Variable:             logcharges   R-squared:                       0.825
## Model:                            OLS   Adj. R-squared:                  0.824
## Method:                 Least Squares   F-statistic:                     625.0
## Date:                Sun, 02 Dec 2018   Prob (F-statistic):               0.00
## Time:                        20:01:19   Log-Likelihood:                -620.29
## No. Observations:                1338   AIC:                             1263.
## Df Residuals:                    1327   BIC:                             1320.
## Df Model:                          10                                         
## Covariance Type:            nonrobust                                         
## ==============================================================================
##                  coef    std err          t      P>|t|      [0.025      0.975]
## ------------------------------------------------------------------------------
## const          6.8960      0.072     95.828      0.000       6.755       7.037
## age            0.0416      0.001     48.930      0.000       0.040       0.043
## sex            0.0856      0.021      4.030      0.000       0.044       0.127
## bmi            0.0012      0.002      0.563      0.573      -0.003       0.005
## children       0.1063      0.009     12.106      0.000       0.089       0.124
## smoker         1.2793      0.146      8.769      0.000       0.993       1.566
## northeast      0.1509      0.030      4.953      0.000       0.091       0.211
## northwest      0.0860      0.030      2.827      0.005       0.026       0.146
## southeast      0.0061      0.030      0.202      0.840      -0.053       0.065
## sm_bm          0.0510      0.004     12.130      0.000       0.043       0.059
## sm_ag         -0.0334      0.002    -17.698      0.000      -0.037      -0.030
## ==============================================================================
## Omnibus:                      860.664   Durbin-Watson:                   1.998
## Prob(Omnibus):                  0.000   Jarque-Bera (JB):             8198.732
## Skew:                           2.969   Prob(JB):                         0.00
## Kurtosis:                      13.574   Cond. No.                         744.
## ==============================================================================
## 
## Warnings:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Since we only have two continuous covariates, we can try to give them an extra power.

  • The third try : add \(bmi^{1.5}\)

X_2['bmi^1.5'] = X_2.bmi ** 1.5
regr_test = OLS(y, add_constant(X_2)).fit()
print(regr_test.summary())
##                             OLS Regression Results                            
## ==============================================================================
## Dep. Variable:             logcharges   R-squared:                       0.826
## Model:                            OLS   Adj. R-squared:                  0.825
## Method:                 Least Squares   F-statistic:                     574.1
## Date:                Sun, 02 Dec 2018   Prob (F-statistic):               0.00
## Time:                        20:01:19   Log-Likelihood:                -614.16
## No. Observations:                1338   AIC:                             1252.
## Df Residuals:                    1326   BIC:                             1315.
## Df Model:                          11                                         
## Covariance Type:            nonrobust                                         
## ==============================================================================
##                  coef    std err          t      P>|t|      [0.025      0.975]
## ------------------------------------------------------------------------------
## const          5.9730      0.274     21.823      0.000       5.436       6.510
## age            0.0414      0.001     48.940      0.000       0.040       0.043
## sex            0.0859      0.021      4.060      0.000       0.044       0.127
## bmi            0.0928      0.026      3.528      0.000       0.041       0.144
## children       0.1065      0.009     12.175      0.000       0.089       0.124
## smoker         1.2759      0.145      8.782      0.000       0.991       1.561
## northeast      0.1564      0.030      5.147      0.000       0.097       0.216
## northwest      0.0848      0.030      2.797      0.005       0.025       0.144
## southeast      0.0149      0.030      0.497      0.619      -0.044       0.074
## sm_bm          0.0513      0.004     12.243      0.000       0.043       0.060
## sm_ag         -0.0335      0.002    -17.815      0.000      -0.037      -0.030
## bmi^1.5       -0.0110      0.003     -3.494      0.000      -0.017      -0.005
## ==============================================================================
## Omnibus:                      861.038   Durbin-Watson:                   1.995
## Prob(Omnibus):                  0.000   Jarque-Bera (JB):             8243.481
## Skew:                           2.968   Prob(JB):                         0.00
## Kurtosis:                      13.612   Cond. No.                     4.89e+03
## ==============================================================================
## 
## Warnings:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
## [2] The condition number is large, 4.89e+03. This might indicate that there are
## strong multicollinearity or other numerical problems.
  • The fourth try: add \(age^{1.5}\)

X_2['age^1.5'] = X_2.age ** 1.5
regr_test = OLS(y, add_constant(X_2)).fit()
print(regr_test.summary())
##                             OLS Regression Results                            
## ==============================================================================
## Dep. Variable:             logcharges   R-squared:                       0.828
## Model:                            OLS   Adj. R-squared:                  0.827
## Method:                 Least Squares   F-statistic:                     532.2
## Date:                Sun, 02 Dec 2018   Prob (F-statistic):               0.00
## Time:                        20:01:19   Log-Likelihood:                -607.51
## No. Observations:                1338   AIC:                             1241.
## Df Residuals:                    1325   BIC:                             1309.
## Df Model:                          12                                         
## Covariance Type:            nonrobust                                         
## ==============================================================================
##                  coef    std err          t      P>|t|      [0.025      0.975]
## ------------------------------------------------------------------------------
## const          5.5668      0.294     18.907      0.000       4.989       6.144
## age            0.0776      0.010      7.785      0.000       0.058       0.097
## sex            0.0855      0.021      4.060      0.000       0.044       0.127
## bmi            0.0919      0.026      3.508      0.000       0.041       0.143
## children       0.0966      0.009     10.583      0.000       0.079       0.114
## smoker         1.2704      0.145      8.784      0.000       0.987       1.554
## northeast      0.1566      0.030      5.179      0.000       0.097       0.216
## northwest      0.0858      0.030      2.844      0.005       0.027       0.145
## southeast      0.0148      0.030      0.497      0.619      -0.044       0.073
## sm_bm          0.0514      0.004     12.326      0.000       0.043       0.060
## sm_ag         -0.0335      0.002    -17.881      0.000      -0.037      -0.030
## bmi^1.5       -0.0108      0.003     -3.466      0.001      -0.017      -0.005
## age^1.5       -0.0039      0.001     -3.639      0.000      -0.006      -0.002
## ==============================================================================
## Omnibus:                      895.912   Durbin-Watson:                   1.992
## Prob(Omnibus):                  0.000   Jarque-Bera (JB):             9255.812
## Skew:                           3.102   Prob(JB):                         0.00
## Kurtosis:                      14.293   Cond. No.                     9.48e+03
## ==============================================================================
## 
## Warnings:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
## [2] The condition number is large, 9.48e+03. This might indicate that there are
## strong multicollinearity or other numerical problems.
  • The fifth try: What if we take the log away?

    Caution: this procedure is not a typical one for model selectiom. But if we take off the log, the performance certainly get better.
y_2 = ds0.charges
regr_test = OLS(y_2, add_constant(X_2)).fit()
print(regr_test.summary())
##                             OLS Regression Results                            
## ==============================================================================
## Dep. Variable:                charges   R-squared:                       0.845
## Model:                            OLS   Adj. R-squared:                  0.843
## Method:                 Least Squares   F-statistic:                     600.3
## Date:                Sun, 02 Dec 2018   Prob (F-statistic):               0.00
## Time:                        20:01:19   Log-Likelihood:                -13232.
## No. Observations:                1338   AIC:                         2.649e+04
## Df Residuals:                    1325   BIC:                         2.656e+04
## Df Model:                          12                                         
## Covariance Type:            nonrobust                                         
## ==============================================================================
##                  coef    std err          t      P>|t|      [0.025      0.975]
## ------------------------------------------------------------------------------
## const      -7665.5804   3687.192     -2.079      0.038   -1.49e+04    -432.209
## age         -320.3823    124.794     -2.567      0.010    -565.197     -75.567
## sex          509.3387    263.692      1.932      0.054      -7.960    1026.637
## bmi         1056.6806    328.082      3.221      0.001     413.063    1700.298
## children     678.7408    114.266      5.940      0.000     454.579     902.902
## smoker     -2.029e+04   1811.135    -11.200      0.000   -2.38e+04   -1.67e+04
## northeast   1288.7634    378.774      3.402      0.001     545.701    2031.825
## northwest    616.4165    377.748      1.632      0.103    -124.634    1357.467
## southeast    122.5000    374.253      0.327      0.743    -611.693     856.693
## sm_bm       1444.5800     52.238     27.654      0.000    1342.101    1547.059
## sm_ag         -3.7854     23.430     -0.162      0.872     -49.750      42.179
## bmi^1.5     -123.8640     39.080     -3.170      0.002    -200.529     -47.199
## age^1.5       62.3346     13.294      4.689      0.000      36.256      88.413
## ==============================================================================
## Omnibus:                      737.144   Durbin-Watson:                   2.068
## Prob(Omnibus):                  0.000   Jarque-Bera (JB):             4743.381
## Skew:                           2.581   Prob(JB):                         0.00
## Kurtosis:                      10.645   Cond. No.                     9.48e+03
## ==============================================================================
## 
## Warnings:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
## [2] The condition number is large, 9.48e+03. This might indicate that there are
## strong multicollinearity or other numerical problems.

Now we have added all the possible covariates we can, we can consider drop some of them. As we can see from the summary, it seems that the region para is not that important.

X_3 = X_2.drop(columns = ['northwest', 'southeast'])
regr_3 = OLS(y_2, add_constant(X_3)).fit()
print(regr_3.summary())
##                             OLS Regression Results                            
## ==============================================================================
## Dep. Variable:                charges   R-squared:                       0.844
## Model:                            OLS   Adj. R-squared:                  0.843
## Method:                 Least Squares   F-statistic:                     719.5
## Date:                Sun, 02 Dec 2018   Prob (F-statistic):               0.00
## Time:                        20:01:19   Log-Likelihood:                -13233.
## No. Observations:                1338   AIC:                         2.649e+04
## Df Residuals:                    1327   BIC:                         2.655e+04
## Df Model:                          10                                         
## Covariance Type:            nonrobust                                         
## ==============================================================================
##                  coef    std err          t      P>|t|      [0.025      0.975]
## ------------------------------------------------------------------------------
## const      -7462.5827   3670.004     -2.033      0.042   -1.47e+04    -262.941
## age         -322.6577    124.831     -2.585      0.010    -567.546     -77.770
## sex          509.7468    263.786      1.932      0.054      -7.736    1027.230
## bmi         1075.2760    326.981      3.288      0.001     433.821    1716.732
## children     681.9706    114.227      5.970      0.000     457.886     906.056
## smoker     -2.032e+04   1811.634    -11.219      0.000   -2.39e+04   -1.68e+04
## northeast   1036.4906    309.503      3.349      0.001     429.323    1643.658
## sm_bm       1444.4375     52.251     27.644      0.000    1341.934    1546.941
## sm_ag         -3.0043     23.396     -0.128      0.898     -48.902      42.893
## bmi^1.5     -126.7912     38.882     -3.261      0.001    -203.068     -50.514
## age^1.5       62.5812     13.298      4.706      0.000      36.495      88.668
## ==============================================================================
## Omnibus:                      740.495   Durbin-Watson:                   2.062
## Prob(Omnibus):                  0.000   Jarque-Bera (JB):             4801.892
## Skew:                           2.592   Prob(JB):                         0.00
## Kurtosis:                      10.698   Cond. No.                     9.43e+03
## ==============================================================================
## 
## Warnings:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
## [2] The condition number is large, 9.43e+03. This might indicate that there are
## strong multicollinearity or other numerical problems.

the AIC and BIC actually get smaller.

Similarly smoker:age is not that important either.

X_4 = X_3.drop(columns = ['sm_ag'])
regr_4 = OLS(y_2, add_constant(X_4)).fit()
print(regr_4.summary())
##                             OLS Regression Results                            
## ==============================================================================
## Dep. Variable:                charges   R-squared:                       0.844
## Model:                            OLS   Adj. R-squared:                  0.843
## Method:                 Least Squares   F-statistic:                     800.1
## Date:                Sun, 02 Dec 2018   Prob (F-statistic):               0.00
## Time:                        20:01:20   Log-Likelihood:                -13233.
## No. Observations:                1338   AIC:                         2.649e+04
## Df Residuals:                    1328   BIC:                         2.654e+04
## Df Model:                           9                                         
## Covariance Type:            nonrobust                                         
## ==============================================================================
##                  coef    std err          t      P>|t|      [0.025      0.975]
## ------------------------------------------------------------------------------
## const      -7441.2915   3664.898     -2.030      0.043   -1.46e+04    -251.670
## age         -323.1930    124.715     -2.591      0.010    -567.854     -78.532
## sex          509.8762    263.686      1.934      0.053      -7.411    1027.163
## bmi         1075.1073    326.857      3.289      0.001     433.895    1716.320
## children     681.6865    114.163      5.971      0.000     457.727     905.646
## smoker     -2.043e+04   1630.461    -12.527      0.000   -2.36e+04   -1.72e+04
## northeast   1036.8028    309.378      3.351      0.001     429.879    1643.726
## sm_bm       1443.9491     52.093     27.719      0.000    1341.755    1546.143
## bmi^1.5     -126.7498     38.866     -3.261      0.001    -202.996     -50.504
## age^1.5       62.5735     13.292      4.707      0.000      36.497      88.650
## ==============================================================================
## Omnibus:                      740.467   Durbin-Watson:                   2.063
## Prob(Omnibus):                  0.000   Jarque-Bera (JB):             4799.401
## Skew:                           2.592   Prob(JB):                         0.00
## Kurtosis:                      10.695   Cond. No.                     9.41e+03
## ==============================================================================
## 
## Warnings:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
## [2] The condition number is large, 9.41e+03. This might indicate that there are
## strong multicollinearity or other numerical problems.

After the model selection part, we now have our residual:

sns.residplot(np.sum(regr_4.params*X_4,1)+regr_4.params[0], y_2)

Which is much more acceptable than before.

For now, the best model would be: charges = constant + age + sex + bmi + children + smokeryes + northeastornot + sm_bm + bmi^1.5 + age^1.5

Caution: As you may see, the coefficient of smokeryes is negative. Please don’t take the result as smoking is good for your health. Since smoke is also used in smoke:bmi, bmi is large. So the bad influence of smoking has been transferred to the smoke:bmi term.

By R

R

Data Analyis: Insurance

The following packages are used in R version of the tutorial.

# libraries--------------------------------------------------------------------
library(faraway)
library(readr)
library(dplyr)

Data Preprocessing

source('./group19.R')

# Load data--------------------------------------------------------------------
ds0 <- read_csv("insurance.csv")
names(ds0)
## [1] "age"      "sex"      "bmi"      "children" "smoker"   "region"  
## [7] "charges"
#have a glimpse of the data
head(ds0)
## # A tibble: 6 x 7
##     age sex      bmi children smoker region    charges
##   <int> <chr>  <dbl>    <int> <chr>  <chr>       <dbl>
## 1    19 female  27.9        0 yes    southwest  16885.
## 2    18 male    33.8        1 no     southeast   1726.
## 3    28 male    33          3 no     southeast   4449.
## 4    33 male    22.7        0 no     northwest  21984.
## 5    32 male    28.9        0 no     northwest   3867.
## 6    31 female  25.7        0 no     southeast   3757.
# Check if is there any null---------------------------------------------------
is.na(ds0)

No missing value, we’re good.

# Recode: sex:femal = 1, male = 0, smoke: yes=1, no=0
ds0$sex[ds0$sex == "male"]="0"
ds0$sex[ds0$sex == "female"]="1"
ds0$smoker[ds0$smoker == "no"]="0"
ds0$smoker[ds0$smoker == "yes"]="1"

Data Visualization

# Examining the distribution of each variables---------------------------------
hist(ds0$age,xlab="Age", main="Distribution of Age")

hist(ds0$bmi,xlab="BMI", main="Distribution of BMI")

hist(ds0$children,xlab="Children", main="Distribution of Children")

hist(ds0$charges,xlab="Charges", main="Distribution of Charges")

# Take log for charge since its heavy tail-------------------------------------
ds0$logcharges <- log(ds0$charges+1)
hist(ds0$charges)

ds0$logcharges <- log(ds0$charges+1)
hist(ds0$logcharges, breaks = 10)

Model Dignostics

Firstly, we define dependent variable y and covariates X. In this analysis, the reponse is logcharges and the preditors are age, sex, bmi, children, smoker, and region .

VIF

# Conduct the first regression!
# Since they're all smaller than 5, even smaller than 2.
fit0 <- lm(logcharges ~age+sex+bmi+children+smoker+as.factor(region), data=ds0)
X <- model.matrix(fit0)[, -1]
round(vif(X),2)
##                        age                       sex1 
##                       1.02                       1.01 
##                        bmi                   children 
##                       1.11                       1.00 
##                    smoker1 as.factor(region)northwest 
##                       1.01                       1.52 
## as.factor(region)southeast as.factor(region)southwest 
##                       1.65                       1.53
hist(fit0$residuals, xlab="Residuals")

plot(fit0$res, xlab="Residuals")
abline(h=0) # acting like noraml so it's good.

Residual plots

plot(fit)

Partial residual plots

# Partial residual plots-------------------------------------------------------
#Which attempts to show how covariate is related to dependent variable
# if we control for the effects of all other covariates
# partial residual plots look acceptable.
fit <- lm(logcharges~ bmi, data=ds0)
plot(fit)

Model Selection

For this part, we would try different adding variables and try to drop variables that are useless. The primary concern in this case is that we have to add variables so that the residual is relatively indep with y.

  • Original model

fit0 <- lm(logcharges ~age+sex+bmi+children+smoker+as.factor(region), data=ds0)
summary(fit0)
## 
## Call:
## lm(formula = logcharges ~ age + sex + bmi + children + smoker + 
##     as.factor(region), data = ds0)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.07141 -0.19836 -0.04914  0.06601  2.16602 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 6.9556815  0.0732494  94.959  < 2e-16 ***
## age                         0.0345744  0.0008719  39.655  < 2e-16 ***
## sex1                        0.0753873  0.0243966   3.090 0.002043 ** 
## bmi                         0.0133740  0.0020956   6.382 2.41e-10 ***
## children                    0.1018259  0.0100976  10.084  < 2e-16 ***
## smoker1                     1.5541456  0.0302739  51.336  < 2e-16 ***
## as.factor(region)northwest -0.0637739  0.0348992  -1.827 0.067867 .  
## as.factor(region)southeast -0.1571539  0.0350762  -4.480 8.09e-06 ***
## as.factor(region)southwest -0.1289211  0.0350206  -3.681 0.000241 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4442 on 1329 degrees of freedom
## Multiple R-squared:  0.768,  Adjusted R-squared:  0.7666 
## F-statistic: 549.8 on 8 and 1329 DF,  p-value: < 2.2e-16
AIC(fit0)
## [1] 1636.536
BIC(fit0)
## [1] 1688.526
  • The first try: add interactive covariate smoker*bmi

fit1 <-lm(logcharges ~age+sex+bmi+children+smoker+as.factor(region)+smoker*bmi, data=ds0)
summary(fit1)
## 
## Call:
## lm(formula = logcharges ~ age + sex + bmi + children + smoker + 
##     as.factor(region) + smoker * bmi, data = ds0)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.91486 -0.17764 -0.05144  0.04640  2.22282 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 7.2508346  0.0769477  94.231  < 2e-16 ***
## age                         0.0347880  0.0008427  41.280  < 2e-16 ***
## sex1                        0.0870349  0.0236027   3.688 0.000236 ***
## bmi                         0.0034055  0.0022672   1.502 0.133309    
## children                    0.1031176  0.0097574  10.568  < 2e-16 ***
## smoker1                     0.1562914  0.1459708   1.071 0.284497    
## as.factor(region)northwest -0.0711167  0.0337288  -2.108 0.035176 *  
## as.factor(region)southeast -0.1626838  0.0338962  -4.799 1.77e-06 ***
## as.factor(region)southwest -0.1374810  0.0338491  -4.062 5.16e-05 ***
## bmi:smoker1                 0.0455727  0.0046624   9.775  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4292 on 1328 degrees of freedom
## Multiple R-squared:  0.7835, Adjusted R-squared:  0.7821 
## F-statistic: 534.1 on 9 and 1328 DF,  p-value: < 2.2e-16
BIC(fit1)
## [1] 1602.769
# which certainly improve the performance of the model
  • The second try: add an interactive covariate smoker*age

fit2 <-lm(logcharges ~age+sex+bmi+children+smoker+as.factor(region)+smoker*bmi+as.numeric(smoker)*age, data=ds0)
summary(fit2)
## 
## Call:
## lm(formula = logcharges ~ age + sex + bmi + children + smoker + 
##     as.factor(region) + smoker * bmi + as.numeric(smoker) * age, 
##     data = ds0)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.65395 -0.15404 -0.06305 -0.01126  2.34914 
## 
## Coefficients: (1 not defined because of singularities)
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 7.0475291  0.0701861 100.412  < 2e-16 ***
## age                         0.0415643  0.0008495  48.928  < 2e-16 ***
## sex1                        0.0855867  0.0212384   4.030 5.90e-05 ***
## bmi                         0.0011518  0.0020440   0.563   0.5732    
## children                    0.1062970  0.0087818  12.104  < 2e-16 ***
## smoker1                     1.2788788  0.1458651   8.768  < 2e-16 ***
## as.factor(region)northwest -0.0649078  0.0303520  -2.139   0.0327 *  
## as.factor(region)southeast -0.1448205  0.0305173  -4.746 2.30e-06 ***
## as.factor(region)southwest -0.1508956  0.0304676  -4.953 8.26e-07 ***
## as.numeric(smoker)                 NA         NA      NA       NA    
## bmi:smoker1                 0.0510336  0.0042067  12.132  < 2e-16 ***
## age:as.numeric(smoker)     -0.0333924  0.0018870 -17.696  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3862 on 1327 degrees of freedom
## Multiple R-squared:  0.8249, Adjusted R-squared:  0.8235 
## F-statistic:   625 on 10 and 1327 DF,  p-value: < 2.2e-16
BIC(fit2)
## [1] 1326.493
# which increase the performance of the model significantly

Since we only have two continuous covariates, we can try to give them an extra power.

  • The third try: add \(bmi^{1.5}\)

 #since we only have two continuous covariates, we can try to give them an extra power, add bmi^1.5.
fit3 <-lm(logcharges ~age+sex+bmi+children+smoker+as.factor(region)+
            smoker*bmi,as.numeric(smoker)*age+bmi^1.5, data=ds0)
summary(fit3)
## 
## Call:
## lm(formula = logcharges ~ age + sex + bmi + children + smoker + 
##     as.factor(region) + smoker * bmi, data = ds0, subset = as.numeric(smoker) * 
##     age + bmi^1.5)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.03798 -0.17402 -0.03470  0.05657  2.04224 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 7.1763930  0.0720456  99.609  < 2e-16 ***
## age                         0.0375651  0.0007998  46.967  < 2e-16 ***
## sex1                        0.1190553  0.0224451   5.304 1.32e-07 ***
## bmi                        -0.0015026  0.0023231  -0.647   0.5179    
## children                    0.1225808  0.0093304  13.138  < 2e-16 ***
## smoker1                     0.0513262  0.1383354   0.371   0.7107    
## as.factor(region)northwest  0.0100724  0.0315101   0.320   0.7493    
## as.factor(region)southeast -0.0649199  0.0337726  -1.922   0.0548 .  
## as.factor(region)southwest -0.1401404  0.0330701  -4.238 2.41e-05 ***
## bmi:smoker1                 0.0495246  0.0045698  10.837  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4012 on 1328 degrees of freedom
## Multiple R-squared:  0.8077, Adjusted R-squared:  0.8064 
## F-statistic: 619.8 on 9 and 1328 DF,  p-value: < 2.2e-16
BIC(fit3)
## [1] 1422.136
  • The fourth try: add \(age^{1.5}\)

fit4 <-lm(logcharges ~age+sex+bmi+children+smoker+as.factor(region)+
            smoker*bmi,as.numeric(smoker)*age+bmi^1.5+age^1.5, data=ds0)
summary(fit4)
## 
## Call:
## lm(formula = logcharges ~ age + sex + bmi + children + smoker + 
##     as.factor(region) + smoker * bmi, data = ds0, subset = as.numeric(smoker) * 
##     age + bmi^1.5 + age^1.5)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.91636 -0.18144 -0.05810  0.02661  2.13811 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 7.2645976  0.0841646  86.314  < 2e-16 ***
## age                         0.0347150  0.0008966  38.720  < 2e-16 ***
## sex1                        0.1085825  0.0249587   4.350 1.46e-05 ***
## bmi                         0.0027220  0.0024320   1.119   0.2632    
## children                    0.0792971  0.0107489   7.377 2.84e-13 ***
## smoker1                     0.1785370  0.1539089   1.160   0.2463    
## as.factor(region)northwest -0.0821657  0.0375473  -2.188   0.0288 *  
## as.factor(region)southeast -0.0423219  0.0348873  -1.213   0.2253    
## as.factor(region)southwest -0.0817111  0.0350003  -2.335   0.0197 *  
## bmi:smoker1                 0.0419282  0.0048774   8.596  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4515 on 1328 degrees of freedom
## Multiple R-squared:  0.7465, Adjusted R-squared:  0.7447 
## F-statistic: 434.4 on 9 and 1328 DF,  p-value: < 2.2e-16
BIC(fit4)
## [1] 1738.278
  • The fifth try: What if we take the log away? so we use charges instead of logcharges for the response.

fit5 <-lm(charges ~age+sex+bmi+children+smoker+as.factor(region)+
            smoker*bmi,as.numeric(smoker)*age+bmi^1.5+age^1.5, data=ds0)
summary(fit5)
## 
## Call:
## lm(formula = charges ~ age + sex + bmi + children + smoker + 
##     as.factor(region) + smoker * bmi, data = ds0, subset = as.numeric(smoker) * 
##     age + bmi^1.5 + age^1.5)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9501.1 -2050.8 -1444.9  -458.8 24540.4 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 -1730.677    911.048  -1.900 0.057695 .  
## age                           257.277      9.705  26.510  < 2e-16 ***
## sex1                         1003.023    270.168   3.713 0.000214 ***
## bmi                            -3.801     26.325  -0.144 0.885205    
## children                      238.749    116.353   2.052 0.040372 *  
## smoker1                    -20194.774   1666.003 -12.122  < 2e-16 ***
## as.factor(region)northwest   -921.164    406.435  -2.266 0.023584 *  
## as.factor(region)southeast   -335.442    377.641  -0.888 0.374564    
## as.factor(region)southwest   -641.983    378.864  -1.694 0.090406 .  
## bmi:smoker1                  1422.834     52.796  26.949  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4887 on 1328 degrees of freedom
## Multiple R-squared:  0.8247, Adjusted R-squared:  0.8235 
## F-statistic: 694.2 on 9 and 1328 DF,  p-value: < 2.2e-16
BIC(fit5)
## [1] 26597.19

Caution: this procedure is not a typical one for model selectiom. But if we take off the log, the performance certainly get better.

Now we have added all the possible covariates we can, we can consider drop some of them. As we can see from the summary, it seems that the region parameter is not that important, so I’m only keep northeast in region column.

new_ds0 <- ds0[which(ds0$region=="northeast"),]
#Recode: sex:femal = 1, male = 0, smoke: yes=1, no=0
new_ds0$sex[new_ds0$sex == "male"]="0"
new_ds0$sex[new_ds0$sex == "female"]="1"
new_ds0$smoker[new_ds0$smoker == "no"]="0"
new_ds0$smoker[new_ds0$smoker == "yes"]="1"
# Now fit the model with only region is northeast.
fit6 <- lm(charges ~age+sex+bmi+children+smoker+smoker*bmi+
             as.numeric(smoker)*age+I(bmi^1.5)+I(age^1.5), data=new_ds0)
summary(fit6)
## 
## Call:
## lm(formula = charges ~ age + sex + bmi + children + smoker + 
##     smoker * bmi + as.numeric(smoker) * age + I(bmi^1.5) + I(age^1.5), 
##     data = new_ds0)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9406.3 -2174.0 -1538.2  -518.7 21804.4 
## 
## Coefficients: (1 not defined because of singularities)
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             -5132.11    8108.96  -0.633  0.52726    
## age                       -99.62     266.41  -0.374  0.70870    
## sex1                      792.23     563.31   1.406  0.16060    
## bmi                       577.85     748.29   0.772  0.44056    
## children                  716.01     247.08   2.898  0.00402 ** 
## smoker1                -18454.30    3635.64  -5.076 6.62e-07 ***
## as.numeric(smoker)            NA         NA      NA       NA    
## I(bmi^1.5)                -58.07      91.07  -0.638  0.52417    
## I(age^1.5)                 36.39      28.46   1.279  0.20201    
## bmi:smoker1              1444.18     115.75  12.477  < 2e-16 ***
## age:as.numeric(smoker)    -45.95      52.15  -0.881  0.37901    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4997 on 314 degrees of freedom
## Multiple R-squared:  0.8084, Adjusted R-squared:  0.8029 
## F-statistic: 147.2 on 9 and 314 DF,  p-value: < 2.2e-16
AIC(fit6)
## [1] 6450.074
BIC(fit6)
## [1] 6491.662

The AIC and BIC actually get smaller.

Similarly smoker*age is not that important either.

fit7 <- lm(charges ~age+sex+bmi+children+smoker+smoker*bmi+region+I(bmi^1.5)+I(age^1.5),data=new_ds0)
summary(fit7)
AIC(fit7)
BIC(fit7)
summary(fit7)
## 
## Call:
## lm(formula = charges ~ age + sex + bmi + children + smoker + 
##     smoker * bmi + region + I(bmi^1.5) + I(age^1.5), data = new_ds0)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11786.7  -1716.1  -1351.6   -665.2  30838.7 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -7441.29    3664.90  -2.030 0.042513 *  
## age           -323.19     124.72  -2.591 0.009662 ** 
## sex1           509.88     263.69   1.934 0.053369 .  
## bmi           1075.11     326.86   3.289 0.001031 ** 
## children       681.69     114.16   5.971 3.02e-09 ***
## smoker1     -20425.25    1630.46 -12.527  < 2e-16 ***
## region1       1036.80     309.38   3.351 0.000827 ***
## I(bmi^1.5)    -126.75      38.87  -3.261 0.001138 ** 
## I(age^1.5)      62.57      13.29   4.707 2.77e-06 ***
## bmi:smoker1   1443.95      52.09  27.719  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4795 on 1328 degrees of freedom
## Multiple R-squared:  0.8443, Adjusted R-squared:  0.8432 
## F-statistic: 800.1 on 9 and 1328 DF,  p-value: < 2.2e-16
AIC(fit7)
## [1] 26488.94
BIC(fit7)
## [1] 26546.13

After dropping smokerage, AIC and BIC values stay the same.For now, the best model would be: charges = constant + age + sex + bmi + children + smoker + region + bmismoker + bmi^1.5 + age^1.5

Caution: As you may see, the coefficient of smoker1 is negative. Please don’t take the result as smoking is good for your health. Since smoke is also used in smoke:bmi, bmi is large. So the bad influence of smoking has been transferred to the smoke:bmi term.

By Matlab

Matlab

(Yijia)